Introduction to Deep Reinforcement Learning (DRL)

Deep Reinforcement Learning (DRL) merges the high-dimensional representation capabilities of Deep Neural Networks with the optimal control framework of Reinforcement Learning. Unlike supervised or unsupervised learning, DRL agents learn through trial-and-error interaction within a dynamic environment, making sequential decisions without immediate, explicit labels. This integration allows agents to handle complex, raw inputs (like pixel data) directly.

1. The DRL Learning Paradigm

The RL agent operates in a continuous loop: observing the environment State ($S_t$), performing an Action ($A_t$), and receiving a potentially sparse or delayed scalar Reward ($R_{t+1}$). The primary challenge is the credit assignment problem: determining which past actions were responsible for a future reward signal.

2. The Optimization Objective

The ultimate goal is to discover an optimal strategy, or policy ($\pi^*$), which is a mapping from states to actions, that maximizes the Expected Cumulative Discounted Return ($G_t$). The discount factor ($\gamma \in [0, 1]$) is mathematically crucial, defining how much we value immediate rewards versus rewards expected far into the future.

$$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

The Fundamental RL Cycle

An illustration of the Markov Decision Process (MDP) framework. The Agent's policy dictates the action ($A_t$) based on the current state ($S_t$), leading the Environment to transition to a new state ($S_{t+1}$) and provide a reward ($R_{t+1}$).

The Reinforcement Learning Cycle: Agent, Environment, State, Action, Reward

Question 1

How does the DRL agent receive feedback from the environment?

Explicit labels/targets

Backpropagation through time

Scalar reward signal

Labeled demonstration data

Question 2

What does the policy ($\pi$) mathematically represent?

The predicted total reward

A distribution over actions given a state

The probability of transitioning to a new state

The error between predicted and actual returns

Challenge: The Discount Factor

Analyzing the Temporal Horizon.

Consider two scenarios:
1. $\gamma = 0$
2. $\gamma \approx 1$

Describe the agent's behavioral preference in each case regarding the timeline of rewards.

Step 1

How does the choice of $\gamma$ affect the policy's horizon?

Solution:
If $\gamma = 0$, the agent is myopic (shortsighted), focusing only on the immediate reward $R_{t+1}$. If $\gamma \approx 1$, the agent is far-sighted, equally weighting immediate and distant future rewards, leading to planning over a very long horizon.